home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The Original Shareware 1.1
/
The Original Shareware (WeMake CDs)(Volume 1.1)(CDs, Inc)(1993).iso
/
19
/
madtrb40.zip
/
MSTREES.DOC
< prev
next >
Wrap
Text File
|
1986-05-05
|
13KB
|
310 lines
Implementation of a Generalized
Median Split Tree
by Chris Maeder
CS 577
Introduction
An important task in many computer applications today is the
retrieval of information associated with an identifier (or "key")
from a table of identifier-information pairs. Usually these
applications involve sets of keys whose frequency occurrence have a
highly skewed distribution, sometimes so skewed that only a small
fraction of the keys are used in most of the information retrievals.
Typical applications in which such key-information pairs arise
include dictionary word look-up, thesaurus word look-up, and spelling
checking in text processing applications (Sheil [1] cites that in the
English language, for example, 127 of the most frequent words account
for approximately 50 percent of an average text), telephone directory
information, the recognition of operating system commands, and the
compiler's symbol look-up table (where language keywords and other
commonly used variables names will appear more frequently than other
words).
Most look-up techniques ignore the frequencies of these
key-information pairs, thereby tending to scatter the more frequently
occurring keys uniformly throughout the look-up table. This usually
results in an unacceptably high rate of external memory references,
especially if such table is quite large. Thus it should be apparent
that it is preferable to separate out the more frequently occurring
keys into a separate look-up table that can be quickly searched
without external memory references.
We present here an implementation of a median split tree that was
first presented in Communications of the ACM [2], November, 1978,
that optimizes the look-up of frequently occurring keys.
Generalized Median Split Trees
Median split tree (MST) algorithm provides us with a means for
searching sets of identifiers that have highly skewed frequency
distributions.
An MST is essentially a binary search tree with two key entries per
node--a node key entry and a split key entry. The node key entry is
the most frequently accessed key in the subtree. The split key entry
partitions the remaining keys in the subtree into two equal subsets,
each of which are organized as an MST. The MST uses this lexical
median of a node's descendants as its split value in order to force
the search tree to be balanced, thereby achieving both a space
efficient representation of the tree and high search speed.
A query of the MST begins at the root of the MST, then following the
tree's nodes in a binary search fashion, first comparing the most
frequent key and then the split key to determine which subtree to
proceed downward with. The MST frequently encounters the proper keys
early in a query, and therefore has a lower cost per successful
search than a typical balanced binary search.
We can further generalize the above algorithm, as presented in ACM
Transactions on Programming Languages and Systems [3], January, 1980,
to allow each node to handle more than two key entries per node. A
query into the MST then examines each entry in a node sequentially,
stopping once the proper key has been found. If a match has not
occurred after comparing the key entries in a node, the query then
proceeds down the left or right subtree depending upon the comparison
with the last key in the node. This last key in the node acts as the
median split key, partitioning the remaining look-up keys in the
subtree into equal left and right subtrees.
We can thereby define an m-median split tree to be an MST with m
entries per node (where m>2). Observe that the root of each subtree
in an m-MST contains the m-1 most frequently accessed keys for that
particular subtree, and the median split key which partitions the
remainder of the look-up keys in the subtree.
One should note that when m=1 we have essentially a balanced binary
search tree, when m=2 we have a conventional median split tree, and
when m=n we have essentially a sequential search.
Average and worse case complexity for the construction and query
algorithms for m-MST can be found in the referenced material.
Construction of Median Split Trees
We have developed two construction algorithms for the construction of
the m-MST, but note that only the first construction algorithm was
fully implemented. Both construction algorithms assume that the
stored set of keys have been previously sorted according to their
lexical value.
In both construction algorithms, the file containing the look-up keys
is read and the keys are placed into a doubly linked list,
maintaining the keys in their sorted lexical order.
The first m-MST construction algorithm builds the m-MST using only
the lexical ordering of the given set of keys. The construction
algorithm is applied recursively as it builds the tree, continually
splitting the doubly linked list in half at the median split key
linked list element.
Each m-MST subtree's root node is filled with the m-1 most frequently
occurring keys in that particular subtree. Note that the elements
that belong to that particular subtree are the elements that remain
on the passed linked list. The most frequent elements are selected
and deleted from the linked list using a sequential search.
Next, the median split key is selected from the remaining keys in the
linked list. The split key selection is such that it fills the
current m-MST subtree's child nodes as completely as is possible.
This split key splits (or partitions) the remaining keys into the
node's left and right subtrees, hence the linked list is broken in
half at this split key.
As previously stated, the process described above is applied
recursively until the entire m-MST is built.
To somewhat optimize the above construction algorithm it was decided
to construct the m-MST using both orderings of the given set of keys.
Again, one of the orderings is the actual lexical ordering of the
keys as is supplied in the file of look-up keys. These keys were
then ordered according to their frequency of occurrence using a
quicksort on a duplicated linked list, placing the entries into a
priority queue.
The priority queue is then used to determine the most frequently
occurring keys in the current subtree. Note that we could have also
have implemented a heap instead of a priority queue.
The linked list of look-up keys and the priority queue keys are
linked together using a double pointers. This was done so that when
the next most frequent key was removed from the priority queue, it is
possible to immediately delete that same key from the doubly linked
list.
Problems occurred with this second m-MST construction algorithm when
attempting to devise a means of splitting the doubly linked list at
the split key. Observe that the pointers into the priority queue
from the linked list are very intermixed. We did not come up with a
clean, straight forward method of splitting the priority queue about
the split key element. We therefore stopped with this second
construction implementation after it was discovered that the linked
list and priority queue splitting technique was not going to add much
to the construction speed of the m-MST due to its inherit complexity.
To optimize the look-up algorithm it was decided that the node data
record structure should be that of an array of size
[1..MAX_KEYS_PER_NODE] not including the split key. The split key is
separate element in the node's data structure. With this node data
structure we were then able to sort the node's array of look-up keys
based upon their lexical value.
By sorting the keys in the node array according to their lexical
value allowed us to include in the search algorithm a binary search
for searching within each node, rather than the sequential search as
was proposed in []. Using the binary search within each node
effectively increases the look-up speed of keys, especially when
implemented with large node key arrays. Observe that a interpolation
search could be implemented in place of the binary search if so
desired, but that more expensive numeric calculations are involved.
We report that it only took a couple of seconds to construct a m-MST
with 256 keys on a IBM micro computer using the first construction
algorithm. Since we believe that there will not much improvement in
speed with the second construction algorithm, in actual practice it's
probably better to implement only the first (and far more simpler)
construction algorithm. Also note that the m-MST need only be built
once, then being stored in some form externally when not needed.
We included in our implementation a display representation of the
m-MST once it has been constructed. This display demonstrates how
the more frequent keys are positioned further up in the tree, and how
the split keys partition the left and right subtrees.
Selection of Median Split Keys
Median split trees use a node's split key value as the median for
partitioning (with respect to the lexical ordering) the remaining
nodes into the left and right subtrees. The choice of split value is
very important since it can result in a perfectly balanced tree,
thereby allowing optimal information yield for each node comparison.
Observe that it would be preferable to have each node filled to its
capacity in order to keep the tree as completely full as is possible.
Therefore it was necessary to develop a split key function that
provided a lexical order index that allowed us to select a key
closest to the actual median of the keys, and yet completely fill as
many of the subsequent m-MST nodes as is possible.
The median split key function originally proposed in [2] was altered
to take in account the modification in the search algorithm we have
implemented. The following function is used in order to determine a
split key that gives the key closest to the median (based upon the
lexical ordering of the remaining keys), and yet fills the child
nodes completely.
SplitValue=((MAX_KEYS_PER_NODE+1)*IntegerMultiplier
where SplitValue is chosen to be closest to the actual median of the
remaining keys and IntegerMultiplier is of the form 1,2,3,...
One should observe from our sample run that only one node is not
completely full in the entire m-MST. This is because of the way the
above split key function splits the keys at the m-MST root node so
that all remaining nodes, except possibly the right-most node, will
be filled completely.
Queries into the m-MST
Queries into the m-MST are very straight forward. First the node
array is scanned using a binary search. As was stated previously,
this binary search is an optimization on the given search algorithm.
If after the node array search a match has not been found, the split
key is then checked to determine if it possibly matches the query.
Failing that test a determination is made which of the node's two
subtrees the query should continue to search.
A look-up key comparison counter was included in the search
implementation so that there could be some indication of the number
of comparisons being performed for each m-MST query.
Further Enhancements Considered for the m-MST Algorithm
One possible method that was considered but not implemented was to
map the m-MST keys into an array instead of storing the keys in a
actual tree. This type of storage technique has previously been
implemented to store binary search tree representations of data in
languages that do not support pointers, such as Fortran and Cobol.
Note that an already constructed m-MST can be very quickly read into
an array.
A problem that must be overcome in other array implementations of
search trees is tree balance and completeness. This is especially
difficult when dealing with dynamic data where the tree is
continually growing and shrinking. An array implementation of a
m-MST makes a great deal of sense since it is remains stable (i.e. is
not dynamic), is complete, and is very well balanced.
To access a particular node in the array a node access function
similar to what is used in binary search tree array implementation
could be used for the m-MST.
References
1. Sheil, B. A. Median Split Trees: A Fast Look-up Technique for
Frequently Occurring Keys. Comm. ACM 21, 11 (Nov. 1978), 948.
2. Sheil, B. A. Median Split Trees: A Fast Look-up Technique for
Frequently Occurring Keys. Comm. ACM 21, 11 (Nov. 1978), 947-958.
3. Comer, D. A Note on Median Split Trees. ACM Trans on Prog. Lang.
and Sys. 2, 1 (Jan. 1980), 129-133.
4. Comer, D. A Note on Median Split Trees. ACM Trans on Prog. Lang.
and Sys. 2, 1 (Jan. 1980), 130.
5. Sheil, B. A. Median Split Trees: A Fast Look-up Technique for
Frequently Occurring Keys. Comm. ACM 21, 11 (Nov. 1978), 950.